-
Notifications
You must be signed in to change notification settings - Fork 575
feat: Enhance process_systems to recursively search all paths in systems list #5033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the process_systems function to handle lists of system directory paths by iterating over each path and applying expansion logic individually. Previously, when systems was a list, it was simply copied; now each item is processed through the same expansion logic as single string inputs.
Key changes:
- Modified
process_systemsto iterate over list items and expand each path individually - Updated documentation for better clarity on how the function handles both strings and lists
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| deepmd/utils/data_system.py | Refactored process_systems to expand each list item individually, improved docstring clarity, and added explicit type checking with error handling |
| deepmd/utils/argcheck.py | Updated documentation strings to clarify behavior for string and list inputs for both training and validation data |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughUpdated documentation for training and validation data argument descriptions to clarify system path handling. Modified Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant process_systems
participant expand_sys_str
participant rglob_sys_str
rect `#e6f7ff`
Caller->>process_systems: call with systems (str or list)
end
alt input is str
process_systems->>process_systems: normalize -> [str]
end
loop for each path in list
process_systems->>process_systems: decide pattern present?
alt patterns is None
process_systems->>expand_sys_str: expand path
expand_sys_str-->>process_systems: expanded_paths
else patterns present
process_systems->>rglob_sys_str: rglob path with patterns
rglob_sys_str-->>process_systems: matched_paths
end
process_systems->>process_systems: extend result_systems
end
process_systems-->>Caller: return consolidated result_systems (list)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
deepmd/utils/data_system.py (2)
808-817: Consider TypeError and add list item validation.Two suggestions for improvement:
Use TypeError for invalid types: When the input type is incorrect,
TypeErroris more semantically appropriate thanValueError(as suggested by static analysis).Validate list items are strings: Currently, if
systemsis a list containing non-string items (e.g.,["/path", None]or["/path", 123]), the error will occur later inexpand_sys_strorrglob_sys_str, producing a potentially confusing error message.Apply this diff to improve type handling:
# Normalize input to a list of paths to search if isinstance(systems, str): search_paths = [systems] elif isinstance(systems, list): + # Validate all items are strings + for idx, item in enumerate(systems): + if not isinstance(item, str): + raise TypeError( + f"All items in systems list must be str, but systems[{idx}] is {type(item).__name__}." + ) search_paths = systems else: # Handle unsupported input types - raise ValueError( - f"Invalid systems type: {type(systems)}. Must be str or list[str]." + raise TypeError( + f"systems must be str or list[str], got {type(systems).__name__}." )
819-829: Consider deduplicating the final result for edge cases.The current implementation may return duplicate system paths if:
- Multiple paths in the input list overlap (e.g.,
["/data", "/data/subset1"])- The same path appears multiple times in the input list
While
rglob_sys_strdeduplicates within each path search,expand_sys_strdoes not, and duplicates across different search paths are not removed.If deduplication is desired, you could modify the return statement:
result_systems.extend(expanded_paths) -return result_systems +# Deduplicate while preserving order +seen = set() +deduplicated = [] +for system in result_systems: + if system not in seen: + seen.add(system) + deduplicated.append(system) +return deduplicatedAlternatively, if order doesn't matter:
-return result_systems +return list(dict.fromkeys(result_systems)) # Preserves order in Python 3.7+Or simply:
-return result_systems +return list(set(result_systems)) # May change order
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/utils/argcheck.py(2 hunks)deepmd/utils/data_system.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data_system.pydeepmd/utils/argcheck.py
🧬 Code graph analysis (1)
deepmd/utils/data_system.py (1)
deepmd/common.py (2)
expand_sys_str(191-208)rglob_sys_str(211-232)
🪛 Ruff (0.14.2)
deepmd/utils/data_system.py
815-817: Prefer TypeError exception for invalid type
(TRY004)
815-817: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (47)
- GitHub Check: CodeQL analysis (python)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Test C++ (true)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test C++ (false)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (3)
deepmd/utils/argcheck.py (2)
2994-2998: LGTM! Clear documentation of systems parameter behavior.The updated documentation accurately describes the new behavior where both string and list inputs undergo recursive directory search, with each list item processed identically to a standalone string input.
3075-3079: LGTM! Consistent documentation across training and validation data.The validation data documentation mirrors the training data documentation, maintaining consistency and clarity.
deepmd/utils/data_system.py (1)
787-807: LGTM! Updated docstring accurately reflects new behavior.The docstring clearly describes the new uniform treatment of both string and list inputs, where each path is recursively searched for system directories.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #5033 +/- ##
==========================================
- Coverage 84.23% 84.23% -0.01%
==========================================
Files 709 709
Lines 70075 70081 +6
Branches 3618 3619 +1
==========================================
+ Hits 59027 59032 +5
- Misses 9881 9883 +2
+ Partials 1167 1166 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
njzjz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the documentation:
deepmd-kit/doc/train/training-advanced.md
Lines 78 to 80 in 7778e2e
| - {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`. | |
| - `list`: {ref}`systems <training/training_data/systems>` gives the training data systems. | |
| - `str`: {ref}`systems <training/training_data/systems>` should be a valid path. DeePMD-kit will recursively search all data systems in this path. |
6d4ea7d to
71921b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/utils/data_system.py (1)
808-817: UseTypeErrorfor invalid type validation.Static analysis correctly suggests using
TypeErrorinstead ofValueErrorwhen validating input types, as this is the more semantically appropriate exception for type-related errors.Apply this diff:
else: - # Handle unsupported input types raise ValueError( - f"Invalid systems type: {type(systems)}. Must be str or list[str]." + f"Invalid systems type: {type(systems)}. Must be str or list[str]." ) + raise TypeError( + f"Invalid systems type: {type(systems)}. Must be str or list[str]." + )Actually, let me fix that diff:
- else: - # Handle unsupported input types - raise ValueError( - f"Invalid systems type: {type(systems)}. Must be str or list[str]." - ) + else: + raise TypeError( + f"Invalid systems type: {type(systems)}. Must be str or list[str]." + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deepmd/utils/argcheck.py(2 hunks)deepmd/utils/data_system.py(2 hunks)doc/train/training-advanced.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data_system.pydeepmd/utils/argcheck.py
🧬 Code graph analysis (1)
deepmd/utils/data_system.py (1)
deepmd/common.py (2)
expand_sys_str(191-208)rglob_sys_str(211-232)
🪛 LanguageTool
doc/train/training-advanced.md
[uncategorized] ~79-~79: Loose punctuation mark.
Context: ...ey can be a list or a str. - str: {ref}`systems <training/training_data/s...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~80-~80: Loose punctuation mark.
Context: ...or all system subdirectories. - list: {ref}`systems <training/training_data/s...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 Ruff (0.14.3)
deepmd/utils/data_system.py
815-817: Prefer TypeError exception for invalid type
(TRY004)
815-817: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (5)
deepmd/utils/argcheck.py (2)
2996-2998: LGTM! Documentation accurately reflects the enhanced behavior.The docstring clearly describes both input variants (string and list) and how each is processed. The documentation is consistent with the implementation changes in
process_systems.
3078-3080: LGTM! Consistent documentation for validation data.The docstring mirrors the training_data_args documentation and accurately describes the behavior for validation data systems.
doc/train/training-advanced.md (1)
79-80: LGTM! User documentation is clear and accurate.The documentation clearly explains both input variants and emphasizes that list items are processed the same way as individual string inputs, which helps users understand the recursive search behavior.
deepmd/utils/data_system.py (2)
819-827: LGTM! Clear iteration logic with proper expansion.The iteration over search_paths with appropriate expansion logic (either
expand_sys_strorrglob_sys_str) correctly implements the enhanced functionality described in the PR objectives.
808-829: Address deduplication inconsistency: expand_sys_str paths are not deduplicated.The review comment raises valid but partially addressed concerns:
Empty list handling: An empty
result_systemslist is properly validated downstream. Whenprocess_systemsreturns an empty list,DeepmdDataSystem.__init__(line 104-105) raisesValueError("No systems provided")with a clear error message. This is handled as designed.Deduplication: Inconsistency exists between code paths:
- When
patternsis notNone:rglob_sys_stralready deduplicates vialist(set(matches))at line 232 ofdeepmd/common.py- When
patternsisNone:expand_sys_strdoes not deduplicate; if multiple paths insearch_pathsyield overlapping directories, duplicates persistFor consistency and robustness, consider adding deduplication at the end of
process_systems(line 829) to handle all scenarios uniformly, or ensureexpand_sys_strdeduplicates when called viaprocess_systems.
Description
This PR modifies the
process_systemsutility function to change how it handles list inputs.Previously, if the
systemsargument was astr, the function would recursively search that path for systems. However, ifsystemswas alist, the function would return the list as-is, assuming it was already a complete list of system paths.This update unifies the logic. The function now treats every string path—whether it's a single
strinput or an item within alist—as a directory to be recursively searched. It also refactors the internal logic to first normalize the input into a list of paths and then process them uniformly, improving code clarity and maintainability.Motivation and Justification
The original implementation's inconsistent handling of
strversuslistinputs caused two significant problems:input.jsonlike"systems": ["/path/to/training_data"], would fail. The function would not search inside/path/to/training_datafor the actual system directories (e.g.,set.000,set.001, etc.)."systems": ["/path/to/dataset_A", "/path/to/dataset_B"].This change solves both problems by ensuring that paths provided in a list are searched recursively, just as a single string path would be.
Benefits
input.json.stror alist[str].Summary by CodeRabbit
Documentation
Improvements